Towards Accurate Word Segmentation for Chinese Patents

نویسندگان

Si Li

Nianwen Xue

چکیده

A patent is a property right for an invention granted by the government to the inventor. An invention is a solution to a specific technological problem. So patents often have a high concentration of scientific and technical terms that are rare in everyday language. The Chinese word segmentation model trained on currently available everyday language data sets performs poorly because it cannot effectively recognize these scientific and technical terms. In this paper we describe a pragmatic approach to Chinese word segmentation on patents where we train a character-based semi-supervised sequence labeling model by extracting features from a manually segmented corpus of 142 patents, enhanced with information extracted from the Chinese TreeBank. Experiments show that the accuracy of our model reached 95.08% (F1 score) on a held-out test set and 96.59% on development set, compared with an F1 score of 91.48% on development set if the model is trained on the Chinese TreeBank. We also experimented with some existing domain adaptation techniques, the results show that the amount of target domain data and the selected features impact the performance of the domain adaptation techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effective Document-Level Features for Chinese Patent Word Segmentation

A patent is a property right for an invention granted by the government to the inventor. Patents often have a high concentration of scientific and technical terms that are rare in everyday language. However, some scientific and technical terms usually appear with high frequency only in one specific patent. In this paper, we propose a pragmatic approach to Chinese word segmentation on patents wh...

متن کامل

Neural Regularized Domain Adaptation for Chinese Word Segmentation

For Chinese word segmentation, the largescale annotated corpora mainly focus on newswire and only a handful of annotated data is available in other domains such as patents and literature. Considering the limited amount of annotated target domain data, it is a challenge for segmenters to learn domain-specific information while avoid getting over-fitted at the same time. In this paper, we propose...

متن کامل

Chinese Word Segmentation and Information Retrieval

In this paper we present results of experiments with Chinese word segmentation and information retrieval. Our experiments with three different word segmentation algorithms indicate that accurate segmentation measurably improves retrieval performance. We discuss the evaluation of word segmentation algorithms for the purpose of better indexing segmented texts for retrieval.

متن کامل

Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR

It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. In this paper we show that, for Chinese, the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in...

متن کامل

Fast and Accurate Neural Word Segmentation for Chinese

Neural models with minimal feature engineering have achieved competitive performance against traditional methods for the task of Chinese word segmentation. However, both training and working procedures of the current neural models are computationally inefficient. This paper presents a greedy neural word segmenter with balanced word and character embedding inputs to alleviate the existing drawba...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1611.10038 شماره

صفحات -

تاریخ انتشار 2016

Towards Accurate Word Segmentation for Chinese Patents

نویسندگان

چکیده

منابع مشابه

Effective Document-Level Features for Chinese Patent Word Segmentation

Neural Regularized Domain Adaptation for Chinese Word Segmentation

Chinese Word Segmentation and Information Retrieval

Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR

Fast and Accurate Neural Word Segmentation for Chinese

عنوان ژورنال:

اشتراک گذاری